AITopics | Ongtustik Qazaqstan

Collaborating Authors

Ongtustik Qazaqstan

Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

Laiyk, Nurkhan, Orel, Daniil, Joshi, Rituraj, Goloburda, Maiya, Wang, Yuxia, Nakov, Preslav, Koto, Fajri

arXiv.org Artificial IntelligenceFeb-19-2025

Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.

dataset, instruction, kazakhstan, (15 more...)

arXiv.org Artificial Intelligence

2502.13647

Country:

North America > United States (0.14)
Asia > Russia (0.14)
Asia > Kazakhstan > Akmola Region > Astana (0.04)
(18 more...)

Genre:

Research Report (1.00)
Personal (1.00)

Industry:

Law (1.00)
Health & Medicine (1.00)
Banking & Finance (0.93)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback